UmbriaPress Corpus – Lorenzo Mattioli

UmbriaPress is a corpus of press releases by three major news outlets in Umbria: Corriere dell’Umbria, PerugiaToday, Terninrete.

Expected use cases are:

NLP modelling exercises and training
Public discourse analysis

The complete corpus can be downloaded at this link. All resources (including the source code for the scrapers and the visualisation below) are available by clicking on the Code box at the top of this page.

Corpus description

The articles are categorised by city (Terni/Perugia) and time stamped. Since PerugiaToday assigns a main tag to each article, I decided to include those as well in the dataset.

The corpus amounts to a grand total of 168,528 articles. It is, however, rather diverse (thus potentially biased) in terms of regional coverage and language type. The two “fast journalism” outlets (PerugiaToday, Terninrete) completely flood the corpus, making the more traditionally managed Corriere dell’Umbria basically invisible. Perugia is vastly overrepresented, and there is a visible bias towards more recent articles.

Below, an interactive visualisation of the corpus’ composition: